Identification of herbarium specimen sheet components from high‐resolution images using deep learning

Abstract Advanced computer vision techniques hold the potential to mobilise vast quantities of biodiversity data by facilitating the rapid extraction of text‐ and trait‐based data from herbarium specimen digital images, and to increase the efficiency and accuracy of downstream data capture during digitisation. This investigation developed an object detection model using YOLOv5 and digitised collection images from the University of Melbourne Herbarium (MELU). The MELU‐trained ‘sheet‐component’ model—trained on 3371 annotated images, validated on 1000 annotated images, run using ‘large’ model type, at 640 pixels, for 200 epochs—successfully identified most of the 11 component types of the digital specimen images, with an overall model precision measure of 0.983, recall of 0.969 and moving average precision (mAP0.5–0.95) of 0.847. Specifically, ‘institutional’ and ‘annotation’ labels were predicted with mAP0.5–0.95 of 0.970 and 0.878 respectively. It was found that annotating at least 2000 images was required to train an adequate model, likely due to the heterogeneity of specimen sheets. The full model was then applied to selected specimens from nine global herbaria (Biodiversity Data Journal, 7, 2019), quantifying its generalisability: for example, the ‘institutional label’ was identified with mAP0.5–0.95 of between 0.68 and 0.89 across the various herbaria. Further detailed study demonstrated that starting with the MELU‐model weights and retraining for as few as 50 epochs on 30 additional annotated images was sufficient to enable the prediction of a previously unseen component. As many herbaria are resource‐constrained, the MELU‐trained ‘sheet‐component’ model weights are made available and application encouraged.

data digitisation, that is the manual labour required for extraction of these data. These techniques are increasingly being used to extract text and trait-based data from specimen images (Carranza-Rojas et al., 2017;Ott et al., 2020;Triki et al., 2022;Younis et al., 2020).
Greater understanding of the accuracy and efficiency of computer vision techniques as applied to different kinds of herbarium specimens is necessary to understand the potential application of these methods for data mobilisation.
Herbarium specimens and their associated collection data contain a wealth of biodiversity data; documenting morphological diversity, geographic distributions, biome or vegetation occupancy and flowering and fruiting periods of the taxon represented on the specimen, and how these may change over time. These typically dried pressed plant samples are secured to archival sheets, and are accompanied by label(s) on the sheet detailing collector, location and taxon and occasionally contain other elements such as stamps, handwritten notes (outside the label) and accession numbers ( Figure 1). Large-scale digitisation efforts are required in order to provide access to herbarium specimen-associated data (Carranza-Rojas et al., 2017) and to ensure these data are FAIR (findable, accessible, interoperable and reusable; Wilkinson et al., 2016). Critical to the success of the digitisation endeavour is an efficient, scalable, adaptable and cost-effective workflow. An 'object to image to data' workflow, which involves the generation of a digital image of the specimen followed by the transcription of data from the digital image, is used in large-scale digitisation initiatives such as that undertaken by the National Herbarium of New South Wales in Australia (Cox, 2022). The visibility of the specimen label data in the corresponding digital image 'allows the data capture process to be undertaken remotely, both in distance and time' (Haston et al., 2015, p. 116). Digitising enables creation of a 'digital specimen' (Nieva de la Hidalga et al., 2020): generating a digital image of each specimen sheet, manually transcribing some or all of the data present on the specimen label into a searchable database, and then sharing that information for reuse via online biodiversity repositories such as the Atlas of Living Australia (ALA; https://www.ala.org.au/), Global Biodiversity Information Facility (GBIF; https://www.gbif.org/) and iDigBio (https://www.idigb io.org/).
In recent years, research has focussed on optimising specific tasks within such digitisation workflows. Particularly evident is the desire to minimise or remove manual intervention, speed up the process, improve accuracy and reduce costs, particularly with respect to label data transcription (e.g. Granzow-de la Cerda & Beach, 2010;Walton, Livermore, & Bánki, 2020;Walton, Livermore, Dillen, et al., 2020).
Studies have tackled streamlining the imaging process (e.g. Sweeney et al., 2018;Tegelberg et al., 2014) and extending the use of digital images (e.g. Carranza-Rojas et al., 2017;Corney et al., 2018;Triki et al., 2021;Unger et al., 2016;White et al., 2020). The task of interest here is that of harvesting label data from a specimen sheet digital image (SSDI). Important information is held not only on the formal institutional labels but is also present in handwritten notes on the labels and on the specimen sheet itself. The research value of these specimens is maximised when all data present on a specimen and derived digital image are transcribed verbatim, those data are then enriched and/or interpreted and recorded in the collection management system, so that specimen data becomes searchable F I G U R E 1 Examples of specimen sheet digital images from the Melbourne University Herbarium (MELU) (left) MELUM012346a-d (https://online.herba rium.unime lb.edu.au/colle ction objec t/MELUM 012346a); (middle) MELUD121701c (https://online.herba rium.unime lb.edu.au/colle ction objec t/MELUD 121701c); (right) MELUD105252a (https://online.herba rium.unime lb.edu.au/colle ction objec t/MELUD 105252a). and available to other researchers. A first step toward reducing the manual labour-intensive task of initial verbatim data transcription is building a means for artificial intelligence to identify areas where these data are present on the SSDI.
Much of the earlier literature addressing this task concentrates on extracting data from labels via optical character recognition (OCR). Some applied OCR software to the whole SSDI, (e.g. Drinkwater et al., 2014;Haston et al., 2012;Tulig et al., 2012). Other studies identified the label first and then applied OCR; in these cases, selecting or 'marking up' the label area was either (a) manual, (e.g. Alzuru et al., 2016;Anglin et al., 2013;Barber et al., 2013;Dillen et al., 2019;Haston et al., 2015); (b) vaguely described, (e.g. Heidorn & Wei, 2008;Takano et al., 2019Takano et al., , 2020; or (c) proposed as future work (i.e. not actually implemented) (e.g. Haston et al., 2015;Kirchhoff et al., 2018;Moen et al., 2010). Some investigations (e.g. Alzuru et al., 2016;Haston et al., 2015;Owen et al., 2019) demonstrated that applying OCR tools to the label-only images was more effective, faster and more accurate, than applying OCR tools to the whole SSDI. Owen et al. (2019) took this a step further and found that running OCR over individual text lines cropped from a label image was faster than processing the whole label. These findings reinforce the value of pursuing the current research, for having a semiautomated tool which identifies components of an SSDI, which can then be cropped out and further analysed/transcribed, holds potential for downstream elements in the SSDI data collection to be more efficient. Automated identification of components of specimen images lends itself to the application of computer vision (CV) models.
In recent years computer vision models have become more sophisticated (for literature reviews see Hussein et al., 2022, Rocchetti et al., 2021, Wäldchen & Mäder, 2018. While some studies have applied CV methods to the analysis of the plant material, here the application of that technology to identify label and handwritten data is of most interest. Relevant forms of CV include object detection, classification, and semantic segmentation. Semantic segmentation is at the pixel level (Nieva de la Hidalga et al., 2022;Triki et al., 2022;White et al., 2020), whereas object detection methodology uses bounding boxes. And while there is 'some overlap between semantic segmentation and object detection' (Walton, Livermore, & Bánki, 2020;Walton, Livermore, Dillen, et al., 2020, p. 7), the latter can be used 'to identify and segment the different objects that are commonly found on herbarium sheets' (ibid., p. 7). One such tool is YOLO (You Only Look Once, Redmon et al., 2016). The third version, YOLOv3, was applied to SSDIs by Triki et al. (2020Triki et al. ( , 2022; in that study, 4000 SSDIs from the Freidrich Schiller University Jena herbarium Germany (JE) were manually marked-up and used to train a model to identify specific plant traits and organs. Nieva de la Hidalga This paper describes efforts to identify all components of a digital image of an herbarium specimen sheet by training a YOLOv5 object detection model on a subset of MELU SSDIs. As the building of this capacity is itself resource-intensive with respect to time, expertise and computational infrastructure-with smaller and medium-sized collections regularly resource constrained-the key aim was to derive and share practical guidelines to enable other herbaria to integrate such a model in their digitisation workflow. As such, the specific research questions were: 1. Can a model be built to separately identify labels, handwriting and other original information, taxon annotation labels and other components of a specimen sheet digital image? 2. How many images must be annotated to train an effective model?
3. What is required to enable cross-herbarium application of the model, that is, how many new annotated images are needed to retrain a model for a new feature or collection?

| ME THODOLOGY
To answer the first research question, an object detection model was built. The second research question was interrogated by testing model parameters. The third research question involved testing how many additional marked-up images were needed to retrain the model to accurately identify a new feature.

| Choosing YOLOv5
It is usually less labour-intensive to mark up training data for an object detection model than for a semantic segmentation model. With this in mind, taking into account the heterogeneity of the MELU SSDIs and that a substantial number of images would be required for any model, and considering the methods observed in the reviewed literature, an object detection model using YOLOv5 (https://github. com/ultra lytic s/yolov5) was chosen for this investigation (described more below). While a comparative study against other methods and models is a promising research area, the focus of this investigation was to comprehensively investigate and quantify what accuracy could be achieved using this specific model type.
YOLO works through a single neural network base to predict bounding boxes around objects and class probabilities for those boxes (Redmon et al., 2016). The model uses a series of convolutional layers to infer features from the whole image and reduce the size of the spatial dimensions. Detections for the bounding boxes and class probabilities are made on coarse spatial cells resulting from the convolutions and predictions of the same object in multiple cells are corrected using non-maximal suppression. Enhancements were made to the model in the release of YOLO9000 (Redmon & Farhadi, 2017) and YOLOv3 (Redmon & Farhadi, 2018). A Python implementation of this model using PyTorch was released in 2020, named YOLOv5 (Jocher, 2020). This implementation of YOLO was used for this project for its convenience and flexibility. All YOLO training and validation were run on the University of Melbourne's high-performance computing infrastructure using four Intel Xeon E5-2650 CPUs and a single NVIDIA Tesla P100 GPU.

| Phase 1. MELU-trained model
SSDIs from MELU were annotated. A subset of these images was used to train an object detection model, and the remaining SSDIs validated the accuracy of the trained model. Training and validation were then undertaken on various-sized training datasets, also varying modelling parameters. The output is the MELU-trained 'sheetcomponent model' and recommendations for how many annotated images are required to train an effective model.

| Annotating MELU images
Both medium-and high-resolution MELU SSDIs were downloaded from the publicly accessible collection portal (https://online.herba rium.unime lb.edu.au/). In the machine learning context, to 'annotate' a SSDI is to mark up the image to identify the areas of interest.
Contrary then to how the word 'annotation' is used in the herbarium curation field, here is it used to refer to the information from the marking up exercise.
The MELU curator, together with the analytic team, determined SSDI components, or areas of interest. The guiding principles of this part of the study were to maximise the potential value from the annotation exercise, and, therefore, all components on the SSDIs except for the biological specimen were annotated. In this way, this data could be made available for future (as yet unforeseen) summaries and investigations, and the object detection models for this investigation could be consolidated if the analysis suggested this was required. Figure 2 shows two examples of annotations on MELU SSDIs: (1) institutional label; (2) data on the specimen sheet outside of a label ('original data', often handwritten); (3) taxon and other annotation labels; (4) stamps; (5) swing tags attached to specimens; (6) accession number (when outside the institutional label). Also of interest were labels produced as part of the MELU digitisation process: (7) small database labels; (8) medium database labels; (9) full database labels. Further, artefacts from the imaging process that do not remain with the specimen sheet: (10) swatch; (11) scale. When a marked-up box is given one of the above names, they are usually called labels; however, given the context, they will be referred to as component categories here. Often there was more than one variability. In this paper, the phrase 'image-annotations' is used to refer to the set of annotations for a set of SSDIs, not the actual count of those annotations, that is, a total of 4371 image-annotations are available for use.
The annotation data were used to generate collection summaries to identify how common each component was on MELU SSDIs. These data were also used to locate the centre point of each of the SSDI components on the specimen sheets, using twodimensional kernel density estimations (KDE) to create locative 'heat maps'. In total, 282 training-validation dataset combinations (detailed in Table A1) provided indications for the impact of significant SSDI heterogeneity, and guidance for determining how many images must be annotated to train an effective model.

| Assessing trained models
Measures used to evaluate the accuracy of the trained models were: (i) precision; (ii) recall; (iii) F1; (iv) mAP0.5; (v) mAP0. 5-0.95; and (vi) confusion matrix. These measures are well described elsewhere (e.g. Redmon et al., 2016), but as mAP0.5-0.95 is used as the key measure in this work a brief description is worthwhile. Mean average precision (mAP) is effectively a combination of the precision and recall measures, it is between 0 and 1, and the higher the value the better the model. It effectively measures the overlap between the actual and predicted object boundaries (i.e. the 'intersection over union' (IoU)). For example, mAP0.5 is the mAP where the boundaries overlap by at least 50%. Then, mAP0.5-0.95 is the average mAP for IoU between 50% and 95% overlaps in 5% steps. These measures were visualised using the web-based tool Weights and Biases (https://wandb.ai/). Each component category (e.g. 'institutional label', 'swatch') is assessed separately for these measures, and the overall model measures are an arithmetic average across the component categories. When assessing a trained model, YOLOv5 assigns the 'best' epoch for a model is that with the highest value for (10% mAP0.5 + 90% mAP0.5-0.95).

| Phase 2. Applying the sheet-component model to unseen images
The purpose of Phase 2 is to go some way towards answering the It was not expected that the MELU-trained model would cope well with these components as it was not trained on them. Examples of image-annotations are in Figure 3. The annotation data was also used to locate the centre points of SSDI components, for comparison to MELU SSDI heat maps.
The MELU-trained model was initially tested using annotations from each of the nine herbaria separately and then tested against the combined set of benchmark dataset image-annotations. The heterogeneity of the SSDI components and layouts from each herbarium means an 'overall' result was less useful than individual results.
Precision, recall, mAP0.5 and mAP0.5-0.95 along with the confusion matrix were used for the assessment of model accuracy.

| Only using new annotations
The purpose of this group of tests was to determine whether retraining the MELU-trained model only on the additional imageannotations, without including the full MELU training dataset, could be as effective for developing an accurate model. The expectation was that these tests would be faster and, therefore, more practical for other herbaria if the results were comparable. When the whole set of annotations was split between the training and validation datasets, the proportions across each component were checked, to ensure the two datasets were not biased. As demonstrated in Table 2, the proportions (the '% of annotations' columns) are similar. As is the average count of annotations per SSDI.
The 'heat maps' for the centre of the institutional (left) and annotation (right) labels are presented in Figure 4.

| Phase 1: Testing trained models
Early in the testing regime, it was found that the 'large' YOLOv5 model type produced better models than the 'medium' model type with minimal time trade-off. It was also found that running on 1280 pixels took more than three times longer than running on 640 (e.g. while specific to the infrastructure used in this study, a 'large' model   Additionally, components with good overall predictability in the full model (per mAP0.5-0.95 in Figure 5; for example, scale, institutional label) showed less variability across all training dataset sizes than the poorly predicted components (e.g. number). The 'heat map' of centre points for institutional (left) and annotation (right) labels for the SSDIs in the benchmark dataset is shown in Figure 8 and enables comparison to placement in the MELU SSDIs ( Figure 4).

| Phase 2: Applying the MELU model to unseen SSDIs
Validating the revised MELU-trained object detection model against each of the benchmark datasets produced different results by  from each herbarium in the benchmark dataset. Figure 10  and adding 20 better than 30 for Berlin and Kew.  Table 6 lists the four model assessment measures for all key models in this analysis. Note that 'swing tag' is excluded for all outputs in Phases 2 and 3. The measures for models including 'swing tag', and only for 'institutional label', are included in Tables A2 and A3 respectively.

| DISCUSS ION
The above results of this study, as will be explored in more detail in this section, demonstrate that an effective object detection model has been built to identify components of SSDIs. While trained on MELU digitised images, it is shown to be reasonably transferrable to other herbaria SSDIs. The predictive accuracy has been further improved by retraining the MELU model with new image-annotations.

| Phase 1: MELU annotations
On average there were 5.6 annotated components per MELU SSDI (Table 1). Almost all SSDIs have 'swatch' and 'scale'. Of SSDIs without an 'institutional label', these had one of the three MELU digitisation labels. Approximately 28% of the annotated MELU SSDIs have one or more taxon annotations and just over 30% have handwriting present on the specimen-this information alone informs prioritisation of future steps to read data from these SSDI components.
As is standard in curation protocols, institutional and annotation labels were consistently placed in the lower right corner of the specimen sheet (Figure 4). This reflects that many of the MELU SSDIs annotated for this research had been remounted prior to digitisation, with consistent instructions for the positioning of components.

| Value of annotation task
The initial image annotation work represents the largest resource

| Phase 2: Applying to new images without training
The concentrated locations of institutional and annotation labels noted in the MELU SSDIs ( Figure 4) were also seen in the SSDIs from the Dillen et al. (2019) study (Figure 8). While the lower left corner of the specimen sheet is also commonly used for both label types, there is more variability in overall placements (as expected, given these are results across different herbaria) particularly for 'annotation label'.
When the revised MELU-trained sheet-component object detection model was applied to the benchmark image-annotations (without retraining the model) the results varied across the nine herbaria and uphold the basic object detection tenet that a model works best with components close to those it was trained on. Referring to Figure 9, the transferability of 'institutional label' and 'annotation label' was satisfying, though it was noted that some 'annotation labels' are little more than free-hand text on unformatted paper and It should also be noted that the SSDIs selected from the benchmark dataset, and annotated for this investigation, were chosen without consideration of how the specimens were ordered in that dataset. While all of the specimens met the requirements of the Dillen et al. (2019) study, specimens from each participating herbarium varied significantly, for example, in the label or stamp types present, the placement of the labels or stamps, as well as in the format (typed or handwritten) and arrangement of the data on the label.
Therefore, a different selection of SSDIs from the benchmark dataset will result in different model outcomes.
That said, it can be asserted that the revised MELU-trained sheetcomponent object detection model could be directly applied to new SSDIs not from MELU to identify and locate sheet components and would predict reasonably well, particularly for the 'institutional label'. As for all models though, targeted retraining could be conducted to improve outcomes (covered in the next section).

| Phase 3: Applying to new images or components with retraining
Adding new image-annotations to the full MELU training dataset resulted, in most cases, in better predictions than using the untrained MELU model alone. The differences between the two validation sets The 'scale' improvements demonstrate the improvement that minor retraining has on predictions ( Figure 11, right) even more clearly.
Adding as few as 10 new image-annotations has raised mAP0.

| Further work
The research team has incorporated the MELU-trained sheet- Further, such machine-driven component identification, particularly when focussed on labels and integrated with text reading, has the potential for application to many kinds of collections that have initiatives focussed on the digitisation of data stored on pro-forma object or specimen labels.

DATA AVA I L A B I L I T Y S TAT E M E N T
With the intent to contribute to the research of other herbaria and supporting research teams, the following assets and outputs from this research are made available on the condition of (a) full cita-